doi: 10.17586/2226-1494-2020-20-4-545-551


NORMALIZATION OF KAZAKH LANGUAGE WORDS

D. R. Rakhimova, Турганбаева А.О.


Read the full article  ';
Article in Russian

For citation:
Rakhimova D.R., Turganbaeva A.O. Normalization of Kazakh language words. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2020, vol. 20, no. 4, pp. 545–551 (in Russian). doi: 10.17586/2226-1494-2020-20-4-545-551


Abstract
Subject of Research. Models and existing algorithms for normalization of natural language words are considered. The paper describes algorithms for automatic selection of the basic principles for a number of natural languages and possible ways of the normal word form synthesis for the Kazakh language. The research is aimed at creation of a complete classification for the Kazakh language ending system and development of a normalization algorithm for words based on the proposed classification approach for endings and suffixes. Method. Word formation analysis by applying endings for all Kazakh language parts of speech was carried out; a classification of endings and suffixes was presented. The paper discusses all kinds of placement options for endings and suffixes. The total number of various suffixes is 26 526 units and the endings is 3 565 units. All considered types are lexically and semantically valid, but some of them are not applicable. Only those, that are most commonly used, are added to the affix base. The order, that the affixes are added to the base, is presented using sets. Thus, the base is correctly selected. The study does not examine word-forming suffixes, as they change the word stem and contextual interpretation. Basically, word-forming suffixes are added to nouns. Main Results. A complete classification system for endings and suffixes of the Kazakh language has been developed. Deterministic finite automata for various parts of speech are created using all possible options, adding suffixes and endings, taking into account the morphological and lexical features of the Kazakh language grammar. A lexicon-free stemming algorithm is developed using the proposed classification system for endings of the Kazakh language. A normalization system has been implemented, proving the operability of the developed algorithm without a dictionary. The algorithm implementation was tested on the Kazakh language corpus. Punctuation and stop words were initially removed from the specified corpus. Practical Relevance. The results of the work can find application in the text analysis and normalization (lemmatization), as well as in information retrieval systems, in machine translation from the Kazakh language, and other applied problems.

Keywords: natural language processing, Kazakh, ending system, normalization, stemming algorithm

Acknowledgements. The study was supported by the Ministry of Education and Science of the Republic of Kazakhstan within the framework of the AP05132950 scientific project.

References
1. Altenbek G., Wang X.-L. Kazakh segmentation system of inflectional affixes. Proc. of the CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP 2010), Beijing, China, 2010, pp. 183–190.
2. Kessikbayeva G., Cicekli I. Rule based morphological analyzer of kazakh language. Proc. of the 2014 Joint Meeting of SIGMORPHON and SIGFSM. Association for Computational Linguistics, Baltimore, Maryland, USA, 2014, pp. 46–54. doi: 10.3115/v1/W14-2806
3. Bekmanova G., Sharipbay A., Altenbek G., Adali E., Zhetkenbay L., Kamanur U., Zulkhazhav A. A uniform morphological analyzer for the Kazakh and Turkish languages. Available at: http://ceur-ws.org/Vol-1975/paper3.pdf (accessed: 10.02.2020).
4. Fedotov A.M., Tussupov D.A., Sambetbayeva M.A., Yerimbetova A.S., Bakieva A.M., Idrisova A.I. The implementation of the algorithm generating word forms of the Kazakh language. Vestnik NSU. Series: Information Technologies, 2015, vol. 13, no. 1, pp. 107–116. (in Russian)
5. Tukeev U.A., Turganbaeva A. Lexicon - free stemming for Kazakh language. Proc. International Scientific Conference “Computer science and Applied Mathematics”, Almaty, 2016, pp. 84–88. (in Russian)
6. Willett P. The Porter stemming algorithm: then and now. Program, 2006, vol. 40, no. 3, pp. 219–223. doi: 10.1108/00330330610681295
7. Segalovich I. A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. Available at: https://www.semanticscholar.org/paper/A-Fast-Morphological-Algorithm-with-Unknown-Word-by-Segalovich/983b7014df3b7d4e82e32ba4f45f71f3879f8c96 (accessed: 01.03.2020).
8. Iborodikhin A. Basic snowball stemming algorith for kazakh language. Available at: https://github.com/iborodikhin/stemmer-kaz/ (accessed: 27.03.2020).
9. Rakhimova D., Zhumanov Zh. Complex technology of machine translation resources extension for the Kazakh language. Studies in Computational Intelligence, 2017, vol. 710, pp. 297–307. doi: 10.1007/978-3-319-56660-3_26
10. Rakhimova D.R. Development of information and analytical data retrieval system in Kazakh language. Report N ГР 0118РК00127, Almaty, 2018, 84 p. (in Russian)
11. Shormakova A., Zhumanov Zh., Rakhimova D. Post-editing of words in Kazakh sentences for information retrieval. Journal of Theoretical and Applied Information Technology, 2019, vol. 97, no. 6, pp. 1896–1908.
12. Nozhov I.M. Morphological and syntactic text processing (models and programs). Moscow, 2003, 140 р. (in Russian)
13. Kutuzov A., Andreev I. Texts in, Meaning out: neural language models in semantic similarity tasks for Russian. Komp'juternaja Lingvistika i Intellektual'nye Tehnologii = Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference «Dialogue» (2015), 2015, vol. 2, no. 14, pp. 133–144.
14. Kalimoldayev M.N., Koibagarov K.Ch., Pak A.A., Zharmagambetov A.S. The application of the connectionist method of semantic similarity for kazakh language. Proc. 12th International Conference on Electronics Computer and Computation (ICECCO 2015), 2015, pp. 7416906. doi: 10.1109/ICECCO.2015.7416906
15. Drakshayani B., Prasad E.V. Semantic based model for text document clustering with idioms. International Journal of Data Engineering (IJDE), 2013, vol. 4, no. 1, pp. 1–13.
16. Verma R., Vuppuluri V. A New approach for idiom identification using meanings and the web. Proc. 10th International Conference on Recent Advances in Natural Language Processing (RANLP 2015), Hissar, Bulgaria, 2015, pp. 681–687.


Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License
Copyright 2001-2024 ©
Scientific and Technical Journal
of Information Technologies, Mechanics and Optics.
All rights reserved.

Яндекс.Метрика